Part 3: Classification

iiiA) GLM

Load packages

Reload model - The primary data set to work with in this part is loaded in the code chunk below.

Let’s now load in that model, but assign it to a different variable name. We can read in an .rds file with the readr::read_rds() function.

train_dataset <- readr::read_rds("my_train_data.rds")
train_dataset |> glimpse()
## Rows: 835
## Columns: 15
## $ R             <dbl> 172, 26, 172, 28, 170, 175, 90, 194, 171, 122, 0, 88, 14…
## $ G             <dbl> 58, 88, 94, 87, 66, 89, 78, 106, 68, 151, 121, 140, 82, …
## $ B             <dbl> 62, 151, 58, 152, 58, 65, 136, 53, 107, 59, 88, 58, 132,…
## $ Lightness     <chr> "dark", "dark", "dark", "dark", "dark", "dark", "dark", …
## $ Saturation    <chr> "bright", "bright", "bright", "bright", "bright", "brigh…
## $ Hue           <dbl> 4, 31, 8, 32, 5, 6, 34, 10, 1, 21, 24, 22, 36, 16, 26, 1…
## $ response      <dbl> 12, 10, 16, 10, 11, 16, 10, 19, 14, 25, 14, 19, 14, 38, …
## $ outcome       <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ LightnessNum  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ SaturationNum <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,…
## $ y             <dbl> -1.9924302, -2.1972246, -1.6582281, -2.1972246, -2.09074…
## $ logit_R       <dbl> 0.72865387, -2.17562547, 0.72865387, -2.09274551, 0.6931…
## $ logit_G       <dbl> -Inf, -1.62924054, -1.40691365, -1.66965677, -3.08534443…
## $ logit_B       <dbl> -2.4624338, 0.1784828, -2.7950616, 0.1996131, -2.7950616…
## $ logit_Hue     <dbl> -2.36712361, 1.79175947, -1.38629436, 2.04769284, -2.047…
df <- readr::read_rds("df.rds")
viz_grid <- readr::read_rds("viz_grid.rds")

Use glm() to fit generalized linear models:

A. Intercept-only model – no INPUTS! B. Categorical variables only – linear additive C. Continuous variables only – linear additive D. All categorical and continuous variables – linear additive E. Interaction of the categorical inputs with all continuous inputs main effects F. Add categorical inputs to all main effect and all pairwise interactions of continuous inputs G. Interaction of the categorical inputs with all main effect and all pairwise interactions of continuous inputs

3 models with basis functions of your choice:

H. Try non-linear basis functions based on your EDA. I. Can consider interactions of basis functions with other basis functions! J. Can consider interactions of basis functions with the categorical inputs!

# A. Intercept-only model:
modA <- glm(outcome ~ 1, data = train_dataset, family = binomial)

#B. Categorical variables only – linear additive
modB <- glm(outcome ~ Lightness + Saturation, data = train_dataset, family = binomial)

# C. Continuous variables only – linear additive
modC <- glm(outcome ~ R + G + B + Hue, data = train_dataset, family = binomial)

# D. All categorical and continuous variables – linear additive
modD <- glm(outcome ~ Lightness + Saturation + R + G + B + Hue, data = train_dataset, family = binomial)

# E. Interaction of the categorical inputs with all continuous inputs main effects
modE <- glm(outcome ~ Lightness * R + Lightness * G + Lightness * B + Lightness * Hue + 
            Saturation * R + Saturation * G + Saturation * B + Saturation * Hue, data = train_dataset, family = binomial)

# F. Add categorical inputs to all main effect and all pairwise interactions of continuous inputs
modF <- glm(outcome ~ (R + G + B + Hue)^2 + Lightness + Saturation, data = train_dataset, family = binomial)

# G. Interaction of the categorical inputs with all main effect and all pairwise interactions of continuous inputs
modG <- glm(outcome ~ (Lightness + Saturation) * (R + G + B + Hue)^2, data = train_dataset, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
library(splines)

# H. Model with non-linear basis functions based on EDA
# Using natural cubic splines for continuous variables
modH <- glm(outcome ~ ns(Hue, df = 3), data = train_dataset, family = binomial)

# I. Can consider interactions of basis functions with other basis functions!
modI <- glm(outcome ~ ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3) + ns(Hue, df = 3), data = train_dataset, family = binomial)

# Model 10: Model considering interactions of basis functions with other basis functions and categorical inputs
# Interaction of spline basis functions with LightnessNum
modJ <- glm(outcome ~ (ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3)) * Lightness, data = train_dataset, family = binomial)
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Saving F and B for future use:

modF %>% readr::write_rds("modF.rds")
modB %>% readr::write_rds("modB.rds")

Did you experience any issues or warnings while fitting the generalized linear models?

Yes, it indicates that the logistic regression model fitted by glm() encountered numerical convergence issues. This warning often arises when the model is perfectly separating the classes based on the predictors, leading to predicted probabilities of 0 or 1 for some observations.

Which of the 10 models is the best? and what performance metric did you use to make your selection?

The solution provided below first defines a function extract_metrics() which is a wrapper to the broom::glance() function. An additional model name column is added to the broom::glance() results.

extract_metrics <- function(mod_object, mod_name)
{
  broom::glance(mod_object) %>% 
    mutate(model_name = mod_name)
}

The logistic regression model training set performance metrics are extracted for each of the 10 models fit in the previous problem.

glm_mle_results <- purrr::map2_dfr(list(modA, modB, modC, modD,
                                        modE, modF, modG, modH, modI, modJ),
                                   LETTERS[1:10],
                                   extract_metrics)

print(glm_mle_results)
## # A tibble: 10 × 9
##    null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##            <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
##  1          898.     834  -449.  900.  905.     898.         834   835
##  2          898.     834  -357.  740.  801.     714.         822   835
##  3          898.     834  -434.  878.  902.     868.         830   835
##  4          898.     834  -348.  731.  811.     697.         818   835
##  5          898.     834  -289.  707. 1014.     577.         770   835
##  6          898.     834  -315.  677.  786.     631.         812   835
##  7          898.     834  -147.  581. 1257.     295.         692   835
##  8          898.     834  -441.  890.  909.     882.         831   835
##  9          898.     834  -366.  758.  820.     732.         822   835
## 10          898.     834  -315.  770. 1101.     630.         765   835
## # ℹ 1 more variable: model_name <chr>

The AIC and BIC are visualized for the 10 models.

glm_mle_results %>% 
  select(model_name, AIC, BIC) %>% 
  pivot_longer(c(AIC, BIC)) %>% 
  ggplot(mapping = aes(x = model_name, y = value)) +
  geom_point(size = 2) +
  facet_wrap(~name, scales = 'free_y') +
  theme_bw()

Finding out the best model:

  • The performance metric used to make selection is AIC and BIC.
  • The Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) are both popular metrics used for model selection, particularly in the context of linear regression and generalized linear models like logistic regression.
  • Both AIC and BIC penalize model complexity to avoid overfitting while also rewarding models that provide a good fit to the data. They achieve this by balancing the goodness of fit (captured by the likelihood function) with a penalty term that increases as the number of parameters in the model increases.
  • Both AIC and BIC have desirable asymptotic properties, meaning that as the sample size increases, they tend to select the true model with high probability, provided the true model is among the candidate models being considered.
  • According to AIC the top models are: G, F, E.
  • According to BIC the top models are: F, B, D.
  • According to AIC the best model is: G
  • According to BIC the best model is: F
  • According to the figure above, the less conservative AIC identifies model G as the best. The more conservative BIC identifies the model F as the best.
  • G is the interaction of the categorical inputs with all main effect and all pairwise interactions of continuous inputs.
  • F is the addition categorical inputs to all main effect and all pairwise interactions of continuous inputs.

Model G: Best model AIC

modG$formula
## outcome ~ (Lightness + Saturation) * (R + G + B + Hue)^2

Model F: Best model BIC

modF$formula
## outcome ~ (R + G + B + Hue)^2 + Lightness + Saturation

Model B: Best model BIC

modB$formula
## outcome ~ Lightness + Saturation

Model D: Best model BIC

modD$formula
## outcome ~ Lightness + Saturation + R + G + B + Hue

Visualize the coefficient summaries for your top 3 models according to BIC:

The coefficients associated with the 3 best models according to BIC are shown below.

modF %>% 
  coefplot::coefplot(intercept = FALSE) +
  theme_bw()

The statistically significant efficients in modF are: 1. Saturationsubdued 2. Saturation shaded 3. SaturationPure 4. Saturation neutral 5. SaturationGray 6. Lightness saturated 7. Lightness pale 8. lightness light 9. lightness midtone.

modB %>% 
  coefplot::coefplot(intercept = FALSE) +
  theme_bw()

The statistically significant efficients in modB are: 1. Saturationsubdued 2. Saturation shaded 3. SaturationPure 4. Saturation neutral 5. SaturationGray 6. Lightness saturated

modD %>% 
  coefplot::coefplot(intercept = FALSE) +
  theme_bw()

The statistically significant efficients in modD are: 1. Saturationsubdued 2. Saturation shaded 3. SaturationPure 4. Saturation neutral 5. SaturationGray

How do the coefficient summaries compare between the top 3?

Based on the coefficient summaries provided for the top 3 models according to BIC, we can make the following comparisons:

  1. Overlap in significant coefficients: Across all three models (modF, modB, and modD), we observe several common statistically significant coefficients related to Saturation and Lightness. Specifically, coefficients associated with various levels of Saturation (e.g., subdued, shaded, pure, neutral, and gray) and Lightness (e.g., saturated, pale, light, and midtone) appear to be significant in all three models.

  2. Additional significant coefficients in modF: Model modF includes additional significant coefficients compared to modB and modD. These additional coefficients might be related to the interactions between the categorical inputs (Saturation and Lightness) and continuous inputs (R, G, B, and Hue) as specified in the model.

  3. Differences in model complexity: Model modF appears to be the most complex among the top 3 models, as it includes interactions between categorical and continuous inputs, as well as additional pairwise interactions of continuous inputs. In contrast, modB and modD have simpler structures, with modD being the simplest.

  4. Consistency in significant coefficients: Despite differences in model complexity, there is consistency in the significant coefficients related to Saturation and Lightness across all three models. This suggests that these factors play a crucial role in predicting the outcome variable, regardless of the specific model structure.

Overall, while modF captures additional interactions between categorical and continuous inputs, the significant coefficients related to Saturation and Lightness remain consistent across all three models. This consistency highlights the importance of these factors in predicting the outcome variable and suggests that they should be considered in further analyses or predictive modeling efforts.

Which inputs seem important?

Based on the coefficient summaries and the consistency of statistically significant coefficients across the top 3 models according to BIC, it appears that the following inputs are important predictors of the outcome variable:

  1. Saturation: Various levels of Saturation, including subdued, shaded, pure, neutral, and gray, are consistently identified as significant predictors across all three models. This suggests that the degree of Saturation in the data significantly influences the outcome variable.

  2. Lightness: Different categories of Lightness, such as saturated, pale, light, and midtone, are also consistently identified as significant predictors in all three models. This indicates that the brightness or darkness of colors plays a crucial role in predicting the outcome.

  3. Interactions between Saturation, Lightness, and other continuous inputs: Model modF, which includes interactions between categorical inputs (Saturation and Lightness) and continuous inputs (R, G, B, and Hue), suggests that these interactions might capture additional predictive information. While not explicitly mentioned in the coefficient summaries, the presence of interaction terms in modF implies that the relationship between Saturation, Lightness, and other continuous inputs might be more nuanced and predictive than their main effects alone.

Conclusion: Saturation and Lightness, along with their potential interactions with other continuous inputs, emerge as important predictors of the outcome variable based on the coefficient summaries of the top 3 models. These findings highlight the significance of color characteristics in influencing the outcome and suggest that further exploration of these relationships could enhance predictive modeling efforts.

Part iii: Classification – iiiB) Bayesian GLM

Fitting Bayesian logistic regression models.

Now that you have gained insights into the relationships, it’s time to delve deeper into the uncertainty by employing Bayesian logistic regression models. In our lectures, we explored how the linear model framework can be extended to handle binary outcomes that are not continuous. This involves changing the likelihood distribution from Gaussian to Binomial and necessitates the use of a non-linear link function. Similar to ordinary linear models, the regression model is applied to a linear predictor that mimics the trend. In this Part, we will formulate the log-posterior function for logistic regression. By doing so, we will be able to directly compare and contrast this process with defining the log-posterior function for the linear model in the previous Part.

The complete probability model for logistic regression consists of the likelihood of the response \({y_n}\) given the event probability \({\mu_n}\), the inverse link function between the probability of the event, \({\mu_n}\) and the linear predictor, ηn, and the prior on all linear predictor model coefficients β. Assume that the β-parameters are a-priori independent Gaussians with a shared prior mean μβ and a shared prior standard deviation τβ.

Probability model for logistic regression:

The n-th observation’s linear predictor using the inner product of the n-th row of a design matrix xn,: and the unknown β-parameter column vector. You can assume that the number of unknown coefficients is equal to D+1. Fitting 10 Bayesian logistic regression models using the same linear predictor trend expressions that you used in the non-Bayesian logistic regression models.

Fit 2 Bayesian generalized linear models: what is chosen and why?

For this analysis, we have decided to select two models for Bayesian logistic regression: modF and modD. These models were chosen based on their performance in terms of BIC, a criterion that balances goodness of fit with model complexity.

modF includes all categorical inputs (Lightness and Saturation) along with all main effects and pairwise interactions of continuous inputs (R, G, B, and Hue). This model is advantageous because it captures potential interactions between the continuous variables, providing a more comprehensive understanding of their combined influence on the outcome. Additionally, including all pairwise interactions allows for a more flexible modeling approach, potentially capturing nonlinear relationships between the predictors and the outcome.

On the other hand, modD incorporates both categorical and continuous variables in a linear additive manner. While it may not capture interactions between continuous variables as explicitly as modF, it still provides a solid foundation for examining the independent effects of each predictor on the outcome. This model is particularly valuable for assessing the individual contributions of categorical and continuous variables to the logistic regression model.

By selecting these models, we aim to explore the impact of different model specifications on the Bayesian logistic regression framework, gaining insights into how the choice of predictors and their interactions influence model inference and prediction.

I revisited this stage after encountering issues with the my_laplace function due to infinite values generated by modF and modG. As a result, I’ve decided to proceed with models modD and modB, which rank among the top three best models based on BIC scores. I encountered difficulties when applying the my_laplace function to modF and modG because the calculated values for mu were either very large or very small, making the process impractical.

Define the design matrices for modF, modD and modB

Xmat_F <- model.matrix( modF$formula, data = train_dataset )
Xmat_D <- model.matrix( modD$formula, data = train_dataset )
Xmat_B <- model.matrix( modB$formula, data = train_dataset )

The coefficient names associated with modD and modB are shown below:

print("The coefficient names associated with modF:")
## [1] "The coefficient names associated with modF:"
modB %>% coef() %>% names()
##  [1] "(Intercept)"        "Lightnessdeep"      "Lightnesslight"    
##  [4] "Lightnessmidtone"   "Lightnesspale"      "Lightnesssaturated"
##  [7] "Lightnesssoft"      "Saturationgray"     "Saturationmuted"   
## [10] "Saturationneutral"  "Saturationpure"     "Saturationshaded"  
## [13] "Saturationsubdued"
print("The coefficient names associated with modD:")
## [1] "The coefficient names associated with modD:"
modD %>% coef() %>% names()
##  [1] "(Intercept)"        "Lightnessdeep"      "Lightnesslight"    
##  [4] "Lightnessmidtone"   "Lightnesspale"      "Lightnesssaturated"
##  [7] "Lightnesssoft"      "Saturationgray"     "Saturationmuted"   
## [10] "Saturationneutral"  "Saturationpure"     "Saturationshaded"  
## [13] "Saturationsubdued"  "R"                  "G"                 
## [16] "B"                  "Hue"

The column names associated with the Xmat_F and Xmat_D design matrix are displayed below: Same as above.

Xmat_B %>% colnames()
##  [1] "(Intercept)"        "Lightnessdeep"      "Lightnesslight"    
##  [4] "Lightnessmidtone"   "Lightnesspale"      "Lightnesssaturated"
##  [7] "Lightnesssoft"      "Saturationgray"     "Saturationmuted"   
## [10] "Saturationneutral"  "Saturationpure"     "Saturationshaded"  
## [13] "Saturationsubdued"
Xmat_D %>% colnames()
##  [1] "(Intercept)"        "Lightnessdeep"      "Lightnesslight"    
##  [4] "Lightnessmidtone"   "Lightnesspale"      "Lightnesssaturated"
##  [7] "Lightnesssoft"      "Saturationgray"     "Saturationmuted"   
## [10] "Saturationneutral"  "Saturationpure"     "Saturationshaded"  
## [13] "Saturationsubdued"  "R"                  "G"                 
## [16] "B"                  "Hue"

The code chunk below uses the all.equal() function to confirm the names are the same between the non-Bayesian models and the design matrices:

purrr::map2_lgl(purrr::map(list(modD,
                                modB),
                           ~names(coef(.))),
                purrr::map(list(Xmat_D,
                                Xmat_B),
                           colnames),
                all.equal)
## [1] TRUE TRUE

The log-posterior function you will program requires the design matrix, the observed output vector, and the prior specification. The lists adhere to the same naming convention as the design matrices. The info_F list pertains to the details regarding model F, whereas info_d pertains to the specifics regarding model D.The lists adhere to the same naming convention as the design matrices. The info_F list pertains to the details regarding model F, whereas info_d pertains to the specifics regarding model D.

info_F <- list(
  yobs = train_dataset$outcome,
  design_matrix = Xmat_F,
  mu_beta = 0,
  tau_beta = 4.5
)

info_D <- list(
  yobs = train_dataset$outcome,
  design_matrix = Xmat_D,
  mu_beta = 0,
  tau_beta = 4.5
)

info_B <- list(
  yobs = train_dataset$outcome,
  design_matrix = Xmat_B,
  mu_beta = 0,
  tau_beta = 4.5
)

Define the log-posterior function for logistic regression

Define the log-posterior function for logistic regression, logistic_logpost(). The first argument to logistic_logpost() is the vector of unknowns and the second argument is the list of required information.

Do you need to separate the β-parameters from the unknowns vector? No,The only unknowns to the logistic regression model are the regression coefficients! The unknowns vector therefore only consists of the β parameters!

logistic_logpost <- function(unknowns, my_info)
{
  # extract the design matrix and assign to X
  X <- my_info$design_matrix
  
  # calculate the linear predictor
  eta <- as.vector( X %*% as.matrix(unknowns))
  
  # calculate the event probability
  mu <- boot::inv.logit(eta)
  
  # evaluate the log-likelihood
  log_lik <- sum(dbinom(x = my_info$yobs,
                        size = 1, 
                        prob = mu,
                        log = TRUE))
  
  # evaluate the log-prior
  log_prior <- sum(dnorm(x = unknowns,
                         mean = my_info$mu_beta,
                         sd = my_info$tau_beta,
                         log = TRUE))
  
  # sum together
  log_lik + log_prior
}
logistic_logpost(rep(0, ncol(Xmat_D)), info_D)
## [1] -619.9692
logistic_logpost(rep(0, ncol(Xmat_B)), info_B)
## [1] -610.2771

The my_laplace() function is provided in the code chunk below:

my_laplace <- function(start_guess, logpost_func, ...)
{
  # code adapted from the `LearnBayes`` function `laplace()`
  fit <- optim(start_guess,
               logpost_func,
               gr = NULL,
               ...,
               method = "BFGS",
               hessian = TRUE,
               control = list(fnscale = -1, maxit = 5001))
  
  mode <- fit$par
  post_var_matrix <- -solve(fit$hessian)
  p <- length(mode)
  int <- p/2 * log(2 * pi) + 0.5 * log(det(post_var_matrix)) + logpost_func(mode, ...)
  # package all of the results into a list
  list(mode = mode,
       var_matrix = post_var_matrix,
       log_evidence = int,
       converge = ifelse(fit$convergence == 0,
                         "YES", 
                         "NO"),
       iter_counts = as.numeric(fit$counts[1]))
}

Using my_laplace() to execute the Laplace Approximation for modF and modD.

Assign the results to the laplace_F through laplace_F objects. The names should be consistent with the design matrices and lists of required information. Thus, laplace_F must correspond to the info_F and Xmat_F objects.

The initial guess does not matter. Logistic regression has a single posterior mode! The optimization will converge as long as we use enough iterations.

laplace_D <- my_laplace(rep(0, ncol(Xmat_D)), logistic_logpost, info_D)
laplace_B <- my_laplace(rep(0, ncol(Xmat_B)), logistic_logpost, info_B)

The code chunk below confirms all models converged:

purrr::map_chr(list(laplace_B, laplace_D),
               'converge')
## [1] "YES" "YES"

The laplace_D, laplace_B object is the Bayesian version of modD, modB respectively that I fit previously.

saving best models

laplace_B %>% readr::write_rds("laplace_B.rds")

Identify the best model among modD and modB:

Let’s identify the best using the Evidence based approach! The code chunk below follows the procedure from an earlier assignment. The log-evidence values are extracted from each model’s laplace_ object. The posterior model weight is then calculated by undoing the log and dividing by the sum of the Evidence.

mod_log_evidences <- purrr::map_dbl(list(laplace_B, laplace_D),
                                     'log_evidence')

all_model_weights <- exp( mod_log_evidences ) / sum(exp(mod_log_evidences))

The posterior model weights are compiled into a dataframe. The bar chart below shows the model weights per model.

tibble::tibble(
  model_name = LETTERS[seq_along(all_model_weights)],
  post_model_weight = all_model_weights
) %>% 
  ggplot(mapping = aes(x = model_name, y = post_model_weight)) +
  geom_bar(stat = 'identity') +
  coord_cartesian(ylim = c(0,1)) +
  theme_bw()

For this application the log-Evidence based approach identifies the same model as the non-Bayesian BIC (modB was 2nd best model according to BIC for non-bayesian glm). The simpler model B is considered the best!

Visualize the regression coefficient posterior summary statistics:

To visualize the regression coefficient posterior summary statistics for the best model (model B), you can use the coefficient summary obtained from the Laplace approximation.

Plot showing the posterior mode of each coefficient along with the 95% confidence interval, based on the standard errors obtained from the Laplace approximation for model B.

# Extract the mode and variance matrix for the best model (model B)
mode_B <- laplace_B$mode
var_matrix_B <- laplace_B$var_matrix

# Calculate standard errors from the diagonal of the variance matrix
se_B <- sqrt(diag(var_matrix_B))

# Create a dataframe with coefficient names, mode, and standard errors
coef_summary_B <- data.frame(
  coefficient = colnames(Xmat_B),
  mode = mode_B,
  se = se_B
)

# Visualize the regression coefficient posterior summary statistics for model B
ggplot(coef_summary_B, aes(x = coefficient, y = mode)) +
  geom_point() +
  geom_errorbar(aes(ymin = mode - 1.96 * se, ymax = mode + 1.96 * se), width = 0.2) +
  coord_flip() +
  theme_bw() +
  labs(title = "Regression Coefficient Posterior Summary Statistics (Model B)")

The plot depicting the regression coefficient posterior summary statistics for model B closely resembles the coefficient summary statistics obtained for the non-Bayesian model modB. This similarity underscores the consistency between the Bayesian and non-Bayesian approaches in identifying the important predictors and estimating their effects on the outcome variable. Both analyses provide insights into the posterior mode and uncertainty (represented by the 95% confidence intervals) associated with each coefficient, facilitating a comprehensive understanding of the model’s predictive performance and explanatory power.

Part iii: Classification – iiiC) GLM Predictions

Examine the predictive trends of the models to better interpret their behavior. We will visualize the trends on a specifically designed prediction grid. We will use the same grid computed in the last part. The prediction grid is reloaded as viz_grid.

viz_grid %>% glimpse()
## Rows: 784
## Columns: 6
## $ R          <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G          <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B          <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness  <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue        <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…

The code chunk below starts the function for you and uses just two input arguments, mvn_result and num_samples. Adapt generate_lm_post_samples() and define generate_glm_post_samples(). We cannot directly use the generate_lm_post_samples() because it includes the back-transformation from φ to σ. The logistic regression model does NOT include σ. Since the only unknowns are the β -parameters, we can determine length_beta by just using the length of the posterior mode vector.

generate_glm_post_samples <- function(mvn_result, num_samples)
{
  # specify the number of unknown beta parameters
  length_beta <- length(mvn_result$mode)
  
  # generate the random samples
  beta_samples <- MASS::mvrnorm(n = num_samples,
                                mu = mvn_result$mode,
                                Sigma = mvn_result$var_matrix)
  
  # change the data type and name
  beta_samples %>% 
    as.data.frame() %>% tibble::as_tibble() %>% 
    purrr::set_names(sprintf("beta_%02d", (1:length_beta) - 1))
}

Define a function which calculates the posterior samples on the linear predictor and the event probability.

The function, post_logistic_pred_samples() is in the code chunk below. It consists of two input arguments Xnew and Bmat. Xnew is a test design matrix where rows correspond to prediction points. The matrix Bmat stores the posterior samples on the β -parameters, where each row is a posterior sample and each column is a parameter. Then, calculate the posterior smaples of the event probability.

post_logistic_pred_samples <- function(Xnew, Bmat)
{
  # calculate the linear predictor at all prediction points and posterior samples
  eta_mat <- Xnew %*% t(Bmat)
  
  # calculate the event probability
  mu_mat <- boot::inv.logit(eta_mat)
  
  # book keeping
  list(eta_mat = eta_mat, mu_mat = mu_mat)
}

The code chunk below defines a function summarize_logistic_pred_from_laplace() which manages the actions necessary to summarize posterior predictions of the event probability. The first argument, mvn_result, is the Laplace Approximation object. The second object is the test design matrix, Xtest, and the third argument, num_samples, is the number of posterior samples to make. This function generate posterior prediction samples of the linear predictor and the event probability, and summarizes the posterior predictions of the event probability.

generate_glm_post_samples(laplace_D, 784)
## # A tibble: 784 × 17
##    beta_00 beta_01 beta_02 beta_03 beta_04 beta_05 beta_06 beta_07 beta_08
##      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1  -1.95  -0.121   1.34     1.08    2.49  -0.296    1.85     4.88  1.04  
##  2  -1.04  -0.497   0.0589   0.404   0.683 -0.395    0.736    4.03  1.46  
##  3   0.112 -0.559   2.01     1.80    2.58   0.329    2.11     3.11 -0.0119
##  4  -0.172 -0.0646  2.27     1.64    2.99   0.493    2.26     3.55  0.235 
##  5  -0.416 -0.588   1.55     1.41    1.83   0.0591   1.77     3.19  0.108 
##  6  -1.14  -0.977   0.416    0.339   0.815 -0.815    0.814    4.09  0.563 
##  7  -0.609  0.102   1.61     2.01    2.94   0.350    1.87     3.60  0.716 
##  8  -0.723 -0.371   2.87     1.78    3.22   0.278    2.48     4.19  1.34  
##  9  -0.867 -0.460   0.984    1.40    1.68  -0.216    2.08     4.42  0.206 
## 10  -1.92  -0.665  -0.155    0.626   1.17  -0.194    0.364    4.64  0.409 
## # ℹ 774 more rows
## # ℹ 8 more variables: beta_09 <dbl>, beta_10 <dbl>, beta_11 <dbl>,
## #   beta_12 <dbl>, beta_13 <dbl>, beta_14 <dbl>, beta_15 <dbl>, beta_16 <dbl>
summarize_logistic_pred_from_laplace <- function(mvn_result, Xtest, num_samples)
{
  # generate posterior samples of the beta parameters
  betas <- generate_glm_post_samples(mvn_result, num_samples)
  
  # data type conversion
  betas <- as.matrix(betas)
  
  # make posterior predictions on the test set
  pred_test <- post_logistic_pred_samples(Xtest, betas)
  
  # calculate summary statistics on the posterior predicted probability
  # summarize over the posterior samples
  
  # posterior mean, should you summarize along rows (rowMeans) or 
  # summarize down columns (colMeans) ???
  mu_avg <- rowMeans(pred_test$mu_mat)
  
  # posterior quantiles
  mu_q05 <- apply(pred_test$mu_mat, 1, stats::quantile, probs = 0.05)
  mu_q95 <- apply(pred_test$mu_mat, 1, stats::quantile, probs = 0.95)
  
  # book keeping
  tibble::tibble(
    mu_avg = mu_avg,
    mu_q05 = mu_q05,
    mu_q95 = mu_q95
  ) %>% 
    tibble::rowid_to_column("pred_id")
}

Define the vizualization grid design matrices consistent with the model B, model D formulas.

Create the design matrices using the viz_grid dataframe

Xviz_D <- model.matrix(~ Lightness + Saturation + R + G + B + Hue, data = viz_grid ) 
Xviz_B <- model.matrix(~ Lightness + Saturation, data = viz_grid )
Xviz_D %>% dim()
## [1] 784  17
Xviz_B %>% dim()
## [1] 784  13

Summarize the posterior predicted event probability associated with the three models on the visualization grid. The prediction summarizes should be executed in the code chunk below:

set.seed(8123) 
post_pred_summary_B <- summarize_logistic_pred_from_laplace(laplace_B, Xviz_B, 2500)
post_pred_summary_D <- summarize_logistic_pred_from_laplace(laplace_D, Xviz_D, 2500)

Print the dimensions of the objects to the screen:

post_pred_summary_B %>% dim()
## [1] 784   4
post_pred_summary_D %>% dim()
## [1] 784   4

The function creates a figure which visualizes the posterior predictive summary statistics of the event probability for a single model for R, G, B input variables.

these plot include predicted mean event probability and the confidence interval.

viz_bayes_logpost_preds <- function(post_pred_summary, input_df)
{
  post_pred_summary %>% 
    left_join(input_df %>% tibble::rowid_to_column('pred_id'),
              by = 'pred_id') %>% 
    mutate(B_range = cut(B, breaks = c(0, 50, 100, 150, 200, 255),
                         labels = c("0-50", "51-100", "101-150", "151-200", "201-255"),
                         include.lowest = TRUE)) %>%
    ggplot(mapping = aes(x = G, y = mu_avg)) +
    geom_point(mapping = aes(color = R), size = 2.0) +
    facet_wrap(~ B_range, labeller = 'label_both') +
    geom_ribbon(mapping = aes(ymin = mu_q05,
                              ymax = mu_q95,
                              group = interaction(R),
                              fill = R),
                alpha = 0.2) +
    labs(x = "G", y = "event probability") +
    theme_bw()
}

The function creates a figure which visualizes the posterior predictive summary statistics of the event probability for a single model for H, S, L input variables.

viz_bayes_logpost_preds_HSL <- function(post_pred_summary, input_df)
{
  post_pred_summary %>% 
    left_join(input_df %>% tibble::rowid_to_column('pred_id'),
              by = 'pred_id') %>% 
    ggplot(mapping = aes(x = Hue, y = mu_avg)) +
    geom_point(mapping = aes(color = Lightness), size = 2.0) +
    facet_wrap(~ Saturation, labeller = 'label_both') +
    geom_ribbon(mapping = aes(ymin = mu_q05,
                              ymax = mu_q95,
                              group = interaction(Lightness),
                              fill = Lightness),
                alpha = 0.2) +
    labs(x = "G", y = "event probability") +
    theme_bw()
}

Part iii: Classification – iiiD) Train/tune with resampling

Instructions: 1. You must train, assess, tune, and compare more complex methods via resampling. 2. You may use either caret or tidymodels to handle the preprocessing, training, testing, and evaluation.

Using caret to manage the training, assessment, and tuning of the glmnet elastic net penalized linear regression model.

The code chunk below imports the caret package

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift

The caret package prefers the binary outcome to be organized as a factor data type compared to an integer. In the Data Exploration part of the project, the categorical values in Lightness and Saturation Columns were reformatted to have integers. So, all the input variables like R, G, B, Hue, LightnessNum and SaturationNum are reformatted to be consistent with the caret preferred structure.

train_dataset %>% glimpse()
## Rows: 835
## Columns: 15
## $ R             <dbl> 172, 26, 172, 28, 170, 175, 90, 194, 171, 122, 0, 88, 14…
## $ G             <dbl> 58, 88, 94, 87, 66, 89, 78, 106, 68, 151, 121, 140, 82, …
## $ B             <dbl> 62, 151, 58, 152, 58, 65, 136, 53, 107, 59, 88, 58, 132,…
## $ Lightness     <chr> "dark", "dark", "dark", "dark", "dark", "dark", "dark", …
## $ Saturation    <chr> "bright", "bright", "bright", "bright", "bright", "brigh…
## $ Hue           <dbl> 4, 31, 8, 32, 5, 6, 34, 10, 1, 21, 24, 22, 36, 16, 26, 1…
## $ response      <dbl> 12, 10, 16, 10, 11, 16, 10, 19, 14, 25, 14, 19, 14, 38, …
## $ outcome       <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ LightnessNum  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ SaturationNum <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,…
## $ y             <dbl> -1.9924302, -2.1972246, -1.6582281, -2.1972246, -2.09074…
## $ logit_R       <dbl> 0.72865387, -2.17562547, 0.72865387, -2.09274551, 0.6931…
## $ logit_G       <dbl> -Inf, -1.62924054, -1.40691365, -1.66965677, -3.08534443…
## $ logit_B       <dbl> -2.4624338, 0.1784828, -2.7950616, 0.1996131, -2.7950616…
## $ logit_Hue     <dbl> -2.36712361, 1.79175947, -1.38629436, 2.04769284, -2.047…

Train and tune the following models:

Generalized Linear models:

  1. D. All categlmical and continuous variables – linear additive
  • modD <- glm(outcome ~ Lightness + Saturation + R + G + B + Hue, data = train_dataset, family = binomial)
  1. F. Add categorical inputs to all main effect and all pairwise interactions of continuous inputs
  • modF <- glm(outcome ~ (R + G + B + Hue)^2 + Lightness + Saturation, data = train_dataset, family = binomial)
  1. The 2 models selected from iiA: they are already chosen so I will be choosing the other 2 within the top models according to BIC.
  • B. Categoriglm variables only – linear additive

  • modB <- glm(outcome ~ Lightness + Saturation, data = train_dataset, family = binomial)

  • Can consider interactions of basis functions with other basis functions.

  • modI <- glm(outcome ~ ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3) + ns(Hue, df = 3), data = train_dataset, family = binomial)

Regularized regression with Elastic net

  1. Add categorical inputs to all main effect and all pairwise interactions of continuous inputs:
  • modF <- glm(outcome ~ (R + G + B + Hue)^2 + Lightness + Saturation, data = train_dataset, family = binomial)
  1. The more complex of the 2 models selected from iiA, it is already chosen
  • D. All categlmical and continuous variables – linear additive
  • modD <- glm(outcome ~ Lightness + Saturation + R + G + B + Hue, data = train_dataset, family = binomial)
Nueral Network, Gradient Boosted Tree Model, Randon forest, GAM, SVN.
Loading Libraries:
library(caret)

The caret package prefers the binary outcome to be organized as a factor data type compared to an integer. The data set is reformatted for you in the code chunk below. The binary outcome y is converted to a new variable outcome with values ‘event’ and ‘non_event’. The first level is forced to be ‘event’ to be consistent with the caret preferred structure.

# Assuming the outcome column in train_dataset is named "outcome_column"
df_caret <- train_dataset %>%
  mutate(outcome = ifelse(outcome == 1, 'popular', 'unpopular')) %>%
  mutate(outcome = factor(outcome, levels = c("popular", "unpopular"))) %>%
  select(R, G, B, Hue, Lightness, Saturation, outcome)

glimpse(df_caret)
## Rows: 835
## Columns: 7
## $ R          <dbl> 172, 26, 172, 28, 170, 175, 90, 194, 171, 122, 0, 88, 144, …
## $ G          <dbl> 58, 88, 94, 87, 66, 89, 78, 106, 68, 151, 121, 140, 82, 163…
## $ B          <dbl> 62, 151, 58, 152, 58, 65, 136, 53, 107, 59, 88, 58, 132, 50…
## $ Hue        <dbl> 4, 31, 8, 32, 5, 6, 34, 10, 1, 21, 24, 22, 36, 16, 26, 12, …
## $ Lightness  <chr> "dark", "dark", "dark", "dark", "dark", "dark", "dark", "da…
## $ Saturation <chr> "bright", "bright", "bright", "bright", "bright", "bright",…
## $ outcome    <fct> popular, popular, popular, unpopular, unpopular, unpopular,…

Specify the resampling scheme to be 10 fold with 3 repeats. Assign the result of the trainControl() function to the my_ctrl object. Specify the primary performance metric to be ‘Accuracy’ and assign that to the my_metric object.

my_ctrl <- trainControl(method = 'repeatedcv', number = 10, repeats = 3)

my_metric <- "Accuracy"

I will start by training, assessing, and tuning an elastic net model using the default caret tuning grid. For this, I’ll use the caret::train() function with the formula interface, specifying a model consistent with model H. This model should interact the categorical input with the linear main continuous effects, interaction between continuous, and quadratic continuous features. I need to ensure the binary outcome is named ‘outcome’ instead of ‘y’ in the formula. I’ll set the method argument to ‘glmnet’ and the metric argument to my_metric. Although the inputs were standardized, I’ll instruct caret to standardize the features by setting the preProcess argument to c(‘center’, ‘scale’). This will allow me to practice standardizing inputs. Additionally, I’ll assign the trControl argument to the my_ctrl object.

It’s important to note that even though I’m using glmnet to fit the model, caret does not require me to organize the design matrix as required by glmnet. Therefore, I don’t need to remove the intercept when defining the formula to caret::train().

After training, assessing, and tuning the glmnet elastic net model, I’ll assign the result to the enet_default object and display the result to the screen. Then, I’ll analyze which tuning parameter combinations are considered the best. I’ll also determine whether the best set of tuning parameters is more consistent with Lasso or Ridge regression.

The random seed is reset before each call to train(). This is critical when using caret to train multiple models on the same data. We want each model to use the same resample splits. Setting the seed before each call to train() makes sure that is indeed the case. If we do not set the seed the resample fold training sets and test sets will be different! We will therefore not compare the models on the exact same data.

Train and assessing using cross validation ‘accuracy’ metric: 1. D. All categlmical and continuous variables – linear additive

set.seed(2001)
# Train and tune the model using caret::train()
modD_default <- train(
  outcome ~ Lightness + Saturation + R + G + B + Hue,                   
  data = df_caret,            
  method = "glm",                  
  family = binomial,               
  preProcess = c("center", "scale"),  
  trControl = my_ctrl              
)
modD_default
## Generalized Linear Model 
## 
## 835 samples
##   6 predictor
##   2 classes: 'popular', 'unpopular' 
## 
## Pre-processing: centered (16), scaled (16) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 752, 751, 752, 752, 751, 752, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8195765  0.3733741

Train and assessing using cross validation ‘accuracy’ metric: Add categorical inputs to all main effect and all pairwise interactions of continuous inputs

set.seed(2001)
# Train and tune the model using caret::train()
modF_default <- train(
  outcome ~ (R + G + B + Hue)^2 + Lightness + Saturation,                   
  data = df_caret,            
  method = "glm",                  
  family = binomial,               
  preProcess = c("center", "scale"),  
  trControl = my_ctrl              
)
modF_default
## Generalized Linear Model 
## 
## 835 samples
##   6 predictor
##   2 classes: 'popular', 'unpopular' 
## 
## Pre-processing: centered (22), scaled (22) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 752, 751, 752, 752, 751, 752, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8132078  0.3887981

In terms of accuracy, modD is more accurate than modF

Train and assessing using cross validation ‘accuracy’ metric: Categorical variables only – linear additive

set.seed(2001)
# Train and tune the model using caret::train()
modB_default <- train(
  outcome ~ Lightness + Saturation,                   
  data = df_caret,            
  method = "glm",                  
  family = binomial,               
  preProcess = c("center", "scale"),  
  trControl = my_ctrl              
)
modB_default
## Generalized Linear Model 
## 
## 835 samples
##   2 predictor
##   2 classes: 'popular', 'unpopular' 
## 
## Pre-processing: centered (12), scaled (12) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 752, 751, 752, 752, 751, 752, ... 
## Resampling results:
## 
##   Accuracy   Kappa   
##   0.8227894  0.370362

The accuracy for this model is higher than modF and modD.

modB_default %>% readr::write_rds("modB_default.rds")

Train and assessing using cross validation ‘accuracy’ metric: Interactions of basis functions with other basis functions.

set.seed(2001)
# Train and tune the model using caret::train()
modI_default <- train(
  outcome ~ ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3) + ns(Hue, df = 3),                   
  data = df_caret,            
  method = "glm",                  
  family = binomial,               
  preProcess = c("center", "scale"),  
  trControl = my_ctrl              
)
modI_default
## Generalized Linear Model 
## 
## 835 samples
##   4 predictor
##   2 classes: 'popular', 'unpopular' 
## 
## Pre-processing: centered (12), scaled (12) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 752, 751, 752, 752, 751, 752, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7948533  0.2758227

The accuracy for this model is lesser than all the models explored so far.

Regularized regression with Elastic net

Train, assess, and tune the glmnet elastic net model with the defined resampling scheme.

  1. F. Add categorical inputs to all main effect and all pairwise interactions of continuous inputs

    modF <- glm(outcome ~ (R + G + B + Hue)^2 + Lightness + Saturation, data = train_dataset, family = binomial)
set.seed(2001)

modF_glmn_default <- train( outcome ~ (R + G + B + Hue)^2 + Lightness + Saturation,
                       data = df_caret,
                       method = 'glmnet',
                       metric = my_metric,
                       preProcess = c("center", "scale"),
                       trControl = my_ctrl)

modF_glmn_default
## glmnet 
## 
## 835 samples
##   6 predictor
##   2 classes: 'popular', 'unpopular' 
## 
## Pre-processing: centered (22), scaled (22) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 752, 751, 752, 752, 751, 752, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        Accuracy   Kappa    
##   0.10   0.0003523518  0.8255860  0.4025064
##   0.10   0.0035235183  0.8239798  0.3821878
##   0.10   0.0352351830  0.8227894  0.3703620
##   0.55   0.0003523518  0.8231764  0.4002840
##   0.55   0.0035235183  0.8231766  0.3785039
##   0.55   0.0352351830  0.8227894  0.3703620
##   1.00   0.0003523518  0.8176016  0.3909140
##   1.00   0.0035235183  0.8239798  0.3794697
##   1.00   0.0352351830  0.8227894  0.3703620
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.0003523518.

Train, assess, and tune the glmnet elastic net model with the defined resampling scheme.

  1. The more complex from previous part is F, which is already trained so I’m taking the other from the previous part, which is modB.
set.seed(2001)

modB_glmn_default <- train( outcome ~ Lightness + Saturation,
                       data = df_caret,
                       method = 'glmnet',
                       metric = my_metric,
                       preProcess = c("center", "scale"),
                       trControl = my_ctrl)

modB_glmn_default
## glmnet 
## 
## 835 samples
##   2 predictor
##   2 classes: 'popular', 'unpopular' 
## 
## Pre-processing: centered (12), scaled (12) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 752, 751, 752, 752, 751, 752, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        Accuracy   Kappa   
##   0.10   0.0003523518  0.8227894  0.370362
##   0.10   0.0035235183  0.8227894  0.370362
##   0.10   0.0352351830  0.8227894  0.370362
##   0.55   0.0003523518  0.8227894  0.370362
##   0.55   0.0035235183  0.8227894  0.370362
##   0.55   0.0352351830  0.8227894  0.370362
##   1.00   0.0003523518  0.8227894  0.370362
##   1.00   0.0035235183  0.8227894  0.370362
##   1.00   0.0352351830  0.8227894  0.370362
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.03523518.

Among the two trained models, modF has slightly higher accuracy than modB.

Neural Network:

library(caret)
df3_caret <- train_dataset %>% 
  mutate(outcome = ifelse(outcome == 1, 'popular', 'unpopular')) %>% 
  mutate(outcome = factor(outcome, levels = c("popular", "unpopular"))) %>% 
  select(R, G, B, Hue, Saturation, Lightness, outcome)

df3_caret %>% glimpse()
## Rows: 835
## Columns: 7
## $ R          <dbl> 172, 26, 172, 28, 170, 175, 90, 194, 171, 122, 0, 88, 144, …
## $ G          <dbl> 58, 88, 94, 87, 66, 89, 78, 106, 68, 151, 121, 140, 82, 163…
## $ B          <dbl> 62, 151, 58, 152, 58, 65, 136, 53, 107, 59, 88, 58, 132, 50…
## $ Hue        <dbl> 4, 31, 8, 32, 5, 6, 34, 10, 1, 21, 24, 22, 36, 16, 26, 12, …
## $ Saturation <chr> "bright", "bright", "bright", "bright", "bright", "bright",…
## $ Lightness  <chr> "dark", "dark", "dark", "dark", "dark", "dark", "dark", "da…
## $ outcome    <fct> popular, popular, popular, unpopular, unpopular, unpopular,…

Training a neural network via the nnet package

Training a neural network to classify the binary outcome, outcome, with respect to all inputs. Since the neural network will attempt to create non-linear relationships, I’m not specifying interaction between the inputs.

The df_caret dataframe only consists of the inputs and the binary outcome. Thus, we can create a formula for all inputs via the dot operator, outcome ~ . The dot operator streamlines the formula interface when the dataframe ONLY consists of inputs and the output.

Train, assess, and tune the nnet neural network with the defined resampling scheme. Assign the result to the nnet_default object and print the result to the screen.

my_ctrl <- trainControl(method = 'repeatedcv', number = 10, repeats = 3)

my_metric_p3 <- "Accuracy"
set.seed(1234)

nnet_default <- train( outcome ~ .,
                       data = df3_caret,
                       method = 'nnet',
                       metric = my_metric_p3,
                       preProcess = c("center", "scale"),
                       trControl = my_ctrl,
                       trace = FALSE)

nnet_default
## Neural Network 
## 
## 835 samples
##   6 predictor
##   2 classes: 'popular', 'unpopular' 
## 
## Pre-processing: centered (16), scaled (16) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 751, 751, 751, 752, 752, 752, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa    
##   1     0e+00  0.7864653  0.2034894
##   1     1e-04  0.7940863  0.2150129
##   1     1e-01  0.8176093  0.3743426
##   3     0e+00  0.8092469  0.4158233
##   3     1e-04  0.8076219  0.4173052
##   3     1e-01  0.8331720  0.4839001
##   5     0e+00  0.8036106  0.4110662
##   5     1e-04  0.7984761  0.4180963
##   5     1e-01  0.8315325  0.4867369
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 3 and decay = 0.1.

The accuracy for default Neural network is for size = 3: decay: 1e-01 accuracy: 0.8331720 Kappa: 0.4839001 As shown by the above display, the best neural network has size = 3 and decay = 0.1. The best neural network therefore consists of 3 hidden units. This is not a large neural network. We would want to consider trying more hidden units to see if the results can be improved further.

plot(nnet_default, xTrans=log)

The code chunk below defines a custom tuning grid focused on tuning the decay parameter. The decay parameter is just the regularization strength and thus is similar to lambda used in elastic net with the ridge penalty!

nnet_grid <- expand.grid(size = c(5, 10, 20),
                         decay = exp(seq(-6, 2, length.out=11)))

nnet_grid %>% dim()
## [1] 33  2

The more refined neural network tuning grid is used below.

set.seed(1234)

nnet_tune <- train( outcome ~ .,
                    data = df3_caret,
                    method = 'nnet',
                    metric = my_metric_p3,
                    tuneGrid = nnet_grid,
                    preProcess = c("center", "scale"),
                    trControl = my_ctrl,
                    trace = FALSE)

nnet_tune$bestTune
##    size     decay
## 28   20 0.1353353
nnet_tune
## Neural Network 
## 
## 835 samples
##   6 predictor
##   2 classes: 'popular', 'unpopular' 
## 
## Pre-processing: centered (16), scaled (16) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 751, 751, 751, 752, 752, 752, ... 
## Resampling results across tuning parameters:
## 
##   size  decay        Accuracy   Kappa    
##    5    0.002478752  0.8196174  0.4603789
##    5    0.005516564  0.8036253  0.4067763
##    5    0.012277340  0.8143683  0.4501354
##    5    0.027323722  0.8279227  0.4905871
##    5    0.060810063  0.8244321  0.4762968
##    5    0.135335283  0.8255656  0.4638801
##    5    0.301194212  0.8320104  0.4724437
##    5    0.670320046  0.8275731  0.4330740
##    5    1.491824698  0.8212094  0.3688582
##    5    3.320116923  0.8212094  0.3656789
##    5    7.389056099  0.7848542  0.1359134
##   10    0.002478752  0.8039928  0.4392154
##   10    0.005516564  0.8055900  0.4497285
##   10    0.012277340  0.8168157  0.4773341
##   10    0.027323722  0.8223621  0.4889601
##   10    0.060810063  0.8248145  0.4917687
##   10    0.135335283  0.8339943  0.5054810
##   10    0.301194212  0.8292083  0.4742128
##   10    0.670320046  0.8291798  0.4444163
##   10    1.491824698  0.8211998  0.3704140
##   10    3.320116923  0.8220078  0.3685276
##   10    7.389056099  0.8076216  0.2819806
##   20    0.002478752  0.7980069  0.4371546
##   20    0.005516564  0.8136648  0.4695291
##   20    0.012277340  0.8160224  0.4779194
##   20    0.027323722  0.8212383  0.4886769
##   20    0.060810063  0.8255842  0.4970033
##   20    0.135335283  0.8363562  0.5176417
##   20    0.301194212  0.8319720  0.4806440
##   20    0.670320046  0.8303844  0.4448305
##   20    1.491824698  0.8211998  0.3684223
##   20    3.320116923  0.8224094  0.3695368
##   20    7.389056099  0.8167919  0.3353003
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 20 and decay = 0.1353353.

The accuracy for default Neural network is for size = 3: decay: 1e-01 accuracy: 0.8331720 Kappa: 0.4839001 The accuracy for Tuned Neural network is for size = 20: decay: 0.135335283 accuracy: 0.8363562 kappa: 0.5176417 So, Tuned neural network has better performance than default. The tuning results are visualized below for the refined larger neural network. By default caret identifies whichever model has the overall best performance. The plot below reveals there is very little difference between the smaller neural network with size 20 hidden units compared to the overall best with 3 hidden units.

plot(nnet_tune, xTrans=log)

The tuning results are visualized below for the refined larger neural network. By default caret identifies whichever model has the overall best performance. The plot below reveals there is very little difference between the smaller neural network with 5 hidden units compared to the overall best with 20 hidden units. I would prefer the less complex neural network in this case. The figure also reveals how the very low regularization strength on the left side of the plot and the very high regularization strength on the ridge side of the plot has worse performance compared to the “middle ground” with moderate regularization strength. The figure below is revealing overfit complex models on the left side and the underfit overly simple models on the right hand side

Use predictions to understand the behavior of the neural network.

The pred_viz_nnet_probs dataframe is column binded to the viz_grid dataframe. The new object, viz_nnet_df, provides the class predicted probabilities for each input combination in the visualization grid.

pred_viz_nnet_probs <- predict( nnet_default, newdata = viz_grid, type = 'prob' )

viz_nnet_df <- viz_grid %>% bind_cols(pred_viz_nnet_probs)

viz_nnet_df %>% glimpse()
## Rows: 784
## Columns: 8
## $ R          <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G          <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B          <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness  <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue        <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular    <dbl> 2.318684e-02, 9.123897e-02, 4.223825e-02, 2.645848e-01, 8.9…
## $ unpopular  <dbl> 0.97681316, 0.90876103, 0.95776175, 0.73541515, 0.10568882,…

The glimpse reveals that the event column stores the predicted event probability. Then visualize the predicted event probability in a manner consistent with the viz_bayes_logpost_preds() function and the tuned elastic net model predictions. Visualize the predicted probability as a line (curve) with respect to G, for each combination of B and R.

# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)

# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")

# Add breaks and labels to the B variable
viz_nnet_df$B_range <- cut(viz_nnet_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)

Visualize the predicted event probability of RGB:

viz_nnet_df %>% 
  ggplot(mapping = aes(x = G, y = popular)) +
  geom_point(mapping = aes(color = R), size = 2.0) +
  facet_wrap(~B_range, labeller = 'label_both') +
  theme_bw()

Motivation behind the graph (RGB):

I preferred analysis outcome (popularity) by RGB and HSL color models. By visual interpretation of the outcome predictions, I aimed to identify color combinations within the RGB color model that exhibit a high level of popularity. This information could be valuable for PPG company in their efforts to create new colors.

By studying the combinations of red, green, and blue values that are associated with higher popularity, PPG can potentially synthesize new colors that are more likely to resonate with consumers.

Interpretations of the RGB graph suggests that colors with higher probability values of popularity have:

  1. R: 0 to 100; G: 0 to 100; B: 101 to 255 - #000065 to #6464ff - Very dark blue to Light blue.

  1. R: 0 to 150; G: 0 to 100; B: 151 to 200; - #009600 - #9664c8
  2. R: 101 to 150; G:101 to 150; B: 0 to 50 - #656500 - #969632

Visualize the predicted event probability of HSL:

viz_nnet_df %>% 
  ggplot(mapping = aes(x = Hue, y = popular)) +
  geom_point(mapping = aes(color = Lightness), size = 2.0) +
  facet_wrap(~Saturation, labeller = 'label_both') +
  geom_density_2d() + 
  theme_bw()
## Warning: `stat_contour()`: Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning: `stat_contour()`: Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning: `stat_contour()`: Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning: `stat_contour()`: Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning: `stat_contour()`: Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf

#### Interpretations of the HSL graph suggests that colors with higher probability values of popularity have: 1. H: 15 to 30; S: Bright; L: dark/soft/pale 2. H: 0 to 15; S: neural; L: saturated, pale, midtone, soft, light

Random forest:

Let’s now a tree based method. You will use the default tuning grid. Tree based models do not have the same kind of preprocessing requirements as other models. Train a random forest binary classifier by setting the method argument equal to “rf”. You must set importance = TRUE in the caret::train() function call. Assign the result to the rf_default variable. Display the rf_default object to the screen.

The random forest model is trained and tuned below. The formula interface uses the . operator to streamline selecting all inputs in the df_caret dataframe.

set.seed(1234)

rf_default <- train( outcome ~ .,
                     data = df3_caret,
                     method = 'rf',
                     metric = my_metric_p3,
                     trControl = my_ctrl,
                     importance = TRUE)

rf_default
## Random Forest 
## 
## 835 samples
##   6 predictor
##   2 classes: 'popular', 'unpopular' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 751, 751, 751, 752, 752, 752, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8259860  0.3792716
##    9    0.8546397  0.5585412
##   16    0.8578284  0.5708417
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 16.

The accuracy for default Random Forest is 0.8578284 and kappa = 0.5708417 for mtry = 16

Random forest behavior through predictions.

Let’s make predictions on the visualization grid, viz_grid, using the random forest model rf_default. Instruct thepredict()function to return the probabilities by setting type = ‘prob’.

pred_viz_rf_probs <- predict( rf_default, newdata = viz_grid, type = 'prob' )

The pred_viz_rf_probs dataframe is column binded to the viz_grid dataframe. The new object, viz_rf_df, provides the class predicted probabilities for each input combination in the visualization grid according to the random forest model.

viz_rf_df <- viz_grid %>% bind_cols(pred_viz_rf_probs)

viz_rf_df %>% glimpse()
## Rows: 784
## Columns: 8
## $ R          <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G          <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B          <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness  <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue        <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular    <dbl> 0.030, 0.034, 0.012, 0.130, 0.758, 0.252, 0.678, 0.202, 0.3…
## $ unpopular  <dbl> 0.970, 0.966, 0.988, 0.870, 0.242, 0.748, 0.322, 0.798, 0.6…

The glimpse reveals that the popular column stores the predicted event probability.

Visualize the predicted event probability of RBG:

# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)

# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")

# Add breaks and labels to the B variable
viz_rf_df$R_range <- cut(viz_rf_df$R, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_rf_df$G_range <- cut(viz_rf_df$G, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_rf_df$B_range <- cut(viz_rf_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_rf_df %>% 
  ggplot(mapping = aes(x = G, y = popular)) +
  geom_point(mapping = aes(color = R),
            size = 1.0) +
  facet_wrap(~B_range, labeller = 'label_both') +
  geom_density_2d() +
  theme_bw()

#### Interpretations of the RGB graph suggests that colors with higher probability values of popularity have: 1. R:0 to 100 G: 0 to 100 B:0 to 50 2. R:0 to 200 G: 0 to 50 B: 51 to 100 3. R:0 to 200 G: 0 to 75 B: 101-150 4. R: 0 to 255 G: 0 to 50 B:151 to 200 5. R: 0 to 255 G: 0 to 50 B:201 to 255

Visualize the predicted event probability of HSL:

viz_rf_df %>% 
  ggplot(mapping = aes(x = Hue, y = popular)) +
  geom_point(mapping = aes(color = Lightness), size = 1.0) +
  facet_wrap(~Saturation, labeller = 'label_both') +
  geom_density_2d() + 
  theme_bw()

The predictions generated by the random forest model appear to be fluctuating abruptly and inconsistently. This is because random forest is a tree based method. Decision trees do not produce smooth predictions.

Tuning Random Forests: I could not comprehend default tuning grid without tuneGrid so I defined the grid.

# Define the grid of hyperparameters
rf_grid <- expand.grid(
  mtry = c(2, 4, 6),  
  nodesize = c(5, 10, 20) 
)

# Check the dimensions of the grid
rf_grid <- as.data.frame(rf_grid)

Tuning the random forest:

# Determine the number of predictor variables
num_predictors <- ncol(df_caret) - 1

# Train the random forest model
rf_tune <- train(outcome ~ R + G + B + Hue + Saturation + Lightness,
                 data = df3_caret,
                 method = 'rf', 
                 metric = my_metric_p3,
                 tuneGrid = expand.grid(.mtry = seq(1, num_predictors)),
                 preProcess = c("center", "scale"),
                 trControl = my_ctrl,
                 trace = FALSE)

# Print the best hyperparameters
rf_tune
## Random Forest 
## 
## 835 samples
##   6 predictor
##   2 classes: 'popular', 'unpopular' 
## 
## Pre-processing: centered (16), scaled (16) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 752, 752, 751, 751, 751, 751, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   1     0.7712590  0.0000000
##   2     0.8259636  0.3735576
##   3     0.8467233  0.4925533
##   4     0.8535362  0.5345724
##   5     0.8579586  0.5571733
##   6     0.8583507  0.5599688
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
  1. The accuracy for default Random Forest is 0.8578284 and kappa = 0.5708417 for mtry = 16
  2. The accuracy for tuned Random Forest is 0.8583507 and kappa = 0.5599688 for mtry = 6 So, the accuracy for tuned is slightly better than the default.

Based on the accuracy metric, the tuned random forest model appears to perform better than the default random forest model.

plot(rf_default, xTrans=log)

plot(rf_tune, xTrans=log)

Tuned random forest behavior through predictions.

pred_viz_rft_probs <- predict( rf_tune, newdata = viz_grid, type = 'prob' )

The pred_viz_rft_probs dataframe is column binded to the viz_grid dataframe. The new object, viz_rft_df, provides the class predicted probabilities for each input combination in the visualization grid according to the random forest model.

viz_rft_df <- viz_grid %>% bind_cols(pred_viz_rft_probs)

viz_rft_df %>% glimpse()
## Rows: 784
## Columns: 8
## $ R          <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G          <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B          <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness  <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue        <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular    <dbl> 0.060, 0.130, 0.058, 0.122, 0.474, 0.208, 0.456, 0.170, 0.4…
## $ unpopular  <dbl> 0.940, 0.870, 0.942, 0.878, 0.526, 0.792, 0.544, 0.830, 0.5…

The glimpse reveals that the event column stores the predicted event probability for tuned Random forest.

Visualize the predicted event probability of RBG for tuned Random forest:

# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)

# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")

# Add breaks and labels to the B variable
viz_rft_df$R_range <- cut(viz_rf_df$R, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_rft_df$G_range <- cut(viz_rf_df$G, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_rft_df$B_range <- cut(viz_rf_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_rft_df %>% 
  ggplot(mapping = aes(x = G, y = popular)) +
  geom_point(mapping = aes(color = R),
            size = 1.0) +
  facet_wrap(~B_range, labeller = 'label_both') +
  geom_density_2d() +
  theme_bw()

Visualize the predicted event probability of HSL:

viz_rft_df %>% 
  ggplot(mapping = aes(x = Hue, y = popular)) +
  geom_point(mapping = aes(color = Lightness), size = 2.0) +
  facet_wrap(~Saturation, labeller = 'label_both') +
  geom_density_2d() + 
  theme_bw()

Gradient Boosted Tree Model

The Gradient Boosted Tree model is trained and tuned below.

# Set the seed for reproducibility
set.seed(1234)
# Train the Gradient Boosted Trees model
gbm_default <- train(outcome ~ .,
                     data = df3_caret,
                     method = 'gbm',
                     metric = "Accuracy",
                     trControl = my_ctrl,
                     verbose = FALSE)

# Print the default GBM model
gbm_default$bestTune
##   n.trees interaction.depth shrinkage n.minobsinnode
## 8     100                 3       0.1             10

Accuracy for GBM:

gbm_default
## Stochastic Gradient Boosting 
## 
## 835 samples
##   6 predictor
##   2 classes: 'popular', 'unpopular' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 751, 751, 751, 752, 752, 752, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.8240157  0.3836788
##   1                  100      0.8279700  0.4048482
##   1                  150      0.8283810  0.4087492
##   2                   50      0.8279748  0.4085514
##   2                  100      0.8411756  0.4747118
##   2                  150      0.8475298  0.5090054
##   3                   50      0.8391577  0.4599565
##   3                  100      0.8554810  0.5376428
##   3                  150      0.8554809  0.5493598
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.

The accuracy for default GBM is 0.8554810, the final values used for the model were n.trees = 100, interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

Examine the Gradient Boosted Trees behavior through predictions.

pred_viz_gbm_probs <- predict( gbm_default, newdata = viz_grid, type = 'prob' )

viz_gbm_df <- viz_grid %>% bind_cols(pred_viz_gbm_probs)

viz_gbm_df %>% glimpse()
## Rows: 784
## Columns: 8
## $ R          <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G          <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B          <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness  <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue        <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular    <dbl> 0.04296871, 0.16840976, 0.04159066, 0.03144592, 0.72002648,…
## $ unpopular  <dbl> 0.95703129, 0.83159024, 0.95840934, 0.96855408, 0.27997352,…

Visualize the predicted event probability for Gradient Boosted Trees: RGB

# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)

# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")

# Add breaks and labels to the B variable
viz_gbm_df$R_range <- cut(viz_gbm_df$R, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gbm_df$G_range <- cut(viz_gbm_df$G, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gbm_df$B_range <- cut(viz_gbm_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)

viz_gbm_df %>% 
  ggplot(mapping = aes(x = G, y = popular)) +
  geom_point(mapping = aes(color = R),
            size = 1.0) +
  facet_wrap(~B_range, labeller = 'label_both') +
  geom_density_2d() +
  theme_bw()

Visualize the predicted event probability for Gradient Boosted Trees: HSL

viz_gbm_df %>% 
  ggplot(mapping = aes(x = Hue, y = popular)) +
  geom_point(mapping = aes(color = Lightness), size = 1.0) +
  facet_wrap(~Saturation, labeller = 'label_both') +
  geom_density_2d() + 
  theme_bw()

Tuning and Training the Gradient Boosted Trees model

# Train and tune the Gradient Boosted Trees model directly in the train function
gbm_tune <- train(outcome ~ .,
                  data = df3_caret,
                  method = 'gbm',
                  metric = "Accuracy",  # Use accuracy as the evaluation metric
                  trControl = my_ctrl,
                  tuneGrid = expand.grid(.n.trees = c(100, 200, 300),
                                         .interaction.depth = seq(1, num_predictors),
                                         .shrinkage = c(0.01, 0.1, 0.3),
                                         .n.minobsinnode = c(5, 10, 20)),
                  verbose = FALSE)
gbm_tune$bestTune
##     n.trees interaction.depth shrinkage n.minobsinnode
## 101     200                 6       0.1              5

Accuracy results for gbm_tune:

The accuracy for default GBM is 0.8554810, the final values used for the model were n.trees = 100, interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10. Tuned Accuracy was used to select the optimal model using the largest value - 0.8599150 and kappa - 0.5796735 The final values used for the model were n.trees = 200, interaction.depth = 6, shrinkage = 0.1 and n.minobsinnode = 5. Slightly greater than default.

Plotting Gradient Boosted Tree default and tuned model

plot(gbm_default, xTrans=log)

plot(gbm_tune, xTrans=log)

Tuned Gradient Boosted Trees model through predictions.

pred_viz_gbmt_probs <- predict( gbm_tune, newdata = viz_grid, type = 'prob' )

The pred_viz_gbmt_probs dataframe is column binded to the viz_grid dataframe. The new object, viz_gbm_df, provides the class predicted probabilities for each input combination in the visualization grid according to the random forest model.

viz_gbmt_df <- viz_grid %>% bind_cols(pred_viz_gbmt_probs)

viz_gbmt_df %>% glimpse()
## Rows: 784
## Columns: 8
## $ R          <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G          <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B          <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness  <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue        <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular    <dbl> 0.042555821, 0.047380040, 0.056188569, 0.004684948, 0.31366…
## $ unpopular  <dbl> 0.95744418, 0.95261996, 0.94381143, 0.99531505, 0.68633079,…

The glimpse reveals that the event column stores the predicted event probability for tuned Gradient Boosting Tree model.

Visualize the predicted event probability of RBG for tuned Gradient Boosted Tree Model:

# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)

# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")

# Add breaks and labels to the B variable
viz_gbmt_df$R_range <- cut(viz_gbmt_df$R, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gbmt_df$G_range <- cut(viz_gbmt_df$G, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gbmt_df$B_range <- cut(viz_gbmt_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gbmt_df %>% 
  ggplot(mapping = aes(x = G, y = popular)) +
  geom_point(mapping = aes(color = R),
            size = 1.0) +
  facet_wrap(~B_range, labeller = 'label_both') +
  geom_density_2d() +
  theme_bw()

Visualize the predicted event probability for Gradient Boosted Trees: HSL

viz_gbmt_df %>% 
  ggplot(mapping = aes(x = Hue, y = popular)) +
  geom_point(mapping = aes(color = Lightness), size = 1.0) +
  facet_wrap(~Saturation, labeller = 'label_both') +
  geom_density_2d() + 
  theme_bw()

Generalized Additive Models (GAM) with caret:

# Set seed for reproducibility
set.seed(1234)

gam_model <- train(
  outcome ~ .,
  data = df3_caret,
  method = "gam",
  metric = "Accuracy",
  trControl = my_ctrl,
  tuneGrid = expand.grid(
    select = c(0.1, 0.2, 0.3),  # Specify the smoothing parameter values to tune
    method = "GCV.Cp"            # Specify the method for tuning the smoothing parameter
  )
)
## Loading required package: mgcv
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
# Print the best hyperparameters
gam_model$bestTune
##   select method
## 1    0.1 GCV.Cp

The output you provided indicates that the best hyperparameters chosen by the tuning process are:

select: 0.1 method: GCV.Cp Here’s what each of these means:

select: This corresponds to the smoothing parameter value chosen for the GAM model. Smoothing is a technique used in GAMs to deal with non-linear relationships between predictors and the outcome. A smaller select value indicates less smoothing, allowing the model to capture more complex patterns in the data. In this case, the tuning process has found that a select value of 0.1 resulted in the best performance based on the chosen metric (in this case, accuracy). method: This specifies the method used for tuning the smoothing parameter. In GAMs, there are different methods available for selecting the optimal smoothing parameter value. Common methods include generalized cross-validation (GCV) and the Cp criterion. Here, the tuning process used the GCV.Cp method.

pred_viz_gam_probs <- predict( gam_model, newdata = viz_grid, type = 'prob' )
viz_gam_df <- viz_grid %>% bind_cols(pred_viz_gam_probs)

viz_gam_df %>% glimpse()
## Rows: 784
## Columns: 8
## $ R          <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G          <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B          <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness  <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue        <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular    <dbl> 0.0032961452, 0.1341051392, 0.0026029970, 0.0003317157, 0.9…
## $ unpopular  <dbl> 9.967039e-01, 8.658949e-01, 9.973970e-01, 9.996683e-01, 1.4…

The glimpse reveals that the event column stores the predicted event probability. Then visualize the predicted event probability in a manner consistent with the viz_bayes_logpost_preds() function and the tuned elastic net model predictions. Visualize the predicted probability as a line (curve) with respect to G, for each combination of B and R.

# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)

# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")

# Add breaks and labels to the B variable
viz_gam_df$R_range <- cut(viz_gam_df$R, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gam_df$G_range <- cut(viz_gam_df$G, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gam_df$B_range <- cut(viz_gam_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)

Visualize the predicted event probability of RGB:

viz_gam_df %>% 
  ggplot(mapping = aes(x = G, y = popular)) +
  geom_point(mapping = aes(color = R), size = 2.0) +
  facet_wrap(~B_range, labeller = 'label_both') +
  geom_density_2d() +  # Add contour lines to represent point density
  theme_bw()

Visualize the predicted event probability of HSL

viz_gam_df %>% 
  ggplot(mapping = aes(x = Hue, y = popular)) +
  geom_point(mapping = aes(color = Lightness), size = 2.0) +
  facet_wrap(~Saturation, labeller = 'label_both') +
  geom_density_2d() + 
  theme_bw()

Tuning Generalized Additive Models (GAM):

tuning_results <- gam_model$results
print(tuning_results)
##   select method  Accuracy     Kappa AccuracySD   KappaSD
## 1    0.1 GCV.Cp 0.8622984 0.5755467 0.03498198 0.1087822
## 2    0.2 GCV.Cp 0.8622984 0.5755467 0.03498198 0.1087822
## 3    0.3 GCV.Cp 0.8622984 0.5755467 0.03498198 0.1087822
# Plot tuning results
ggplot(tuning_results, aes(x = select, y = Accuracy)) +
  geom_line() +   # Line plot
  geom_point() +  # Add points
  labs(x = "Smoothing Parameter Value (select)", y = "Accuracy") +
  ggtitle("Tuning Results for Generalized Additive Model (GAM)") +
  theme_minimal()

Summary: - Accuracy: 0.8622984 - Kappa: 0.5755467 - Accuracy Standard Deviation (SD): 0.03498198 - Kappa Standard Deviation (SD): 0.1087822

Since all three variants have identical accuracy and kappa values, there is no difference in performance among them based on these metrics.

Tuning the more refined GAM grid is used below.

# Set seed for reproducibility
set.seed(1234)

gam_tune <- train(
  outcome ~ .,
  data = df3_caret,
  method = "gam",
  metric = "Accuracy",
  trControl = my_ctrl,
  tuneGrid = expand.grid(
    select = c(0.3, 3, 6),  # Specify the smoothing parameter values to tune
    method = "GCV.Cp"            # Specify the method for tuning the smoothing parameter
  )
)
# Print the best hyperparameters
gam_tune$results
##   select method  Accuracy     Kappa AccuracySD   KappaSD
## 1    0.3 GCV.Cp 0.8622984 0.5755467 0.03498198 0.1087822
## 2    3.0 GCV.Cp 0.8622984 0.5755467 0.03498198 0.1087822
## 3    6.0 GCV.Cp 0.8622984 0.5755467 0.03498198 0.1087822

Based on the provided result for the GAM model, it seems that all three variants of the model (with smoothing parameter values of 0.3, 3.0, and 6.0) have the same accuracy and kappa values. Here’s a breakdown of the metrics:

  • Accuracy: 0.8622984
  • Kappa: 0.5755467
  • Accuracy Standard Deviation (SD): 0.03498198
  • Kappa Standard Deviation (SD): 0.1087822

Since all three variants have identical accuracy and kappa values, there is no difference in performance among them based on these metrics. It’s possible that the smoothing parameter values do not significantly affect the model’s performance in terms of accuracy and kappa.

The accuracy for GAM tuned model for 0.8622984. The accuracy is same before and after tuning.

pred_viz_gamt_probs <- predict( gam_tune, newdata = viz_grid, type = 'prob' )
viz_gamt_df <- viz_grid %>% bind_cols(pred_viz_gamt_probs)

viz_gamt_df %>% glimpse()
## Rows: 784
## Columns: 8
## $ R          <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G          <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B          <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness  <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue        <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular    <dbl> 0.0032961452, 0.1341051392, 0.0026029970, 0.0003317157, 0.9…
## $ unpopular  <dbl> 9.967039e-01, 8.658949e-01, 9.973970e-01, 9.996683e-01, 1.4…

Use predictions to understand the behavior of the GAM

# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)

# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")

# Add breaks and labels to the B variable
viz_gamt_df$R_range <- cut(viz_gamt_df$R, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gamt_df$G_range <- cut(viz_gamt_df$G, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gamt_df$B_range <- cut(viz_gamt_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)

Visualize the predicted event probability of RGB:

viz_gamt_df %>% 
  ggplot(mapping = aes(x = G, y = popular)) +
  geom_point(mapping = aes(color = R), size = 2.0) +
  facet_wrap(~B_range, labeller = 'label_both') +
  geom_density_2d() +  # Add contour lines to represent point density
  theme_bw()

Visualize the predicted event probability of HSL:

viz_gamt_df %>% 
  ggplot(mapping = aes(x = Hue, y = popular)) +
  geom_point(mapping = aes(color = Lightness), size = 2.0) +
  facet_wrap(~Saturation, labeller = 'label_both') +
  geom_density_2d() + 
  theme_bw()

K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) is a simple yet effective algorithm used for both classification and regression tasks. It works by identifying the K nearest data points to a given query point and predicting the class or value based on the most common class or average value among its neighbors, respectively.

We now train the KNN model using the train function from the caret package. We specify the method as “knn” and use accuracy as the evaluation metric.

set.seed(1234)

knn_default <- train(outcome ~ .,
                     data = df3_caret,
                     method = "knn",
                     trControl = my_ctrl,
                     tuneLength = 10)  # Set the desired value of K

knn_default
## k-Nearest Neighbors 
## 
## 835 samples
##   6 predictor
##   2 classes: 'popular', 'unpopular' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 751, 751, 751, 752, 752, 752, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.8075931  0.4400930
##    7  0.8155969  0.4443954
##    9  0.8032517  0.3946884
##   11  0.8036440  0.3817466
##   13  0.7976579  0.3550789
##   15  0.7951959  0.3369989
##   17  0.7991455  0.3349957
##   19  0.7959851  0.3105241
##   21  0.7951911  0.2982999
##   23  0.7915671  0.2766259
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.

The KNN default accuracy is 0.8075931 with k = 5.

From the above outcomes we can see that,

The k-Nearest Neighbors (KNN) algorithm was applied to classify samples into ‘popular_paint’ and ‘non_popular_paint’ classes using a dataset with 835 samples and 6 predictor variables. Through cross-validated performance evaluation, KNN was tuned over a range of K values from 5 to 23. The optimal model achieved an accuracy of approximately 80.48% with a K value of 9. This indicates that when considering the 9 nearest neighbors, the model correctly predicts the class label for about 80.48% of the samples. KNN’s simplicity and effectiveness make it a valuable tool for classification tasks, especially when dealing with relatively small datasets.

Predictions to understand the behavior of the model

# Predictions to understand the behavior of the Gradient Boosted Trees model
pred_viz_knn_probs <- predict(knn_default, newdata = viz_grid, type = 'prob')

# Combine predictions with visualization grid
viz_knn_df <- viz_grid %>% bind_cols(pred_viz_knn_probs)

# Display a glimpse of the combined dataframe
viz_knn_df %>% glimpse()
## Rows: 784
## Columns: 8
## $ R          <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G          <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B          <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness  <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue        <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular    <dbl> 0.1428571, 0.4285714, 0.7142857, 0.1428571, 0.1428571, 0.14…
## $ unpopular  <dbl> 0.8571429, 0.5714286, 0.2857143, 0.8571429, 0.8571429, 0.85…

Plot for nb_default

Lets try to analyze it by a graph

plot(knn_default,main="KNN Default Plot", xTrans=log)

The plot showcases the relationship between the number of neighbors (K) considered in the k-Nearest Neighbors (KNN) algorithm and the corresponding classification accuracy. As the number of neighbors increases from 0 to 4 along the x-axis, the accuracy initially declines until it reaches its lowest point around 2.6. Subsequently, there’s a gradual rise in accuracy until approximately 2.8 neighbors, followed by a slight decrease. Notably, the highest accuracy is achieved when K equals 1, denoting that the model performs optimally when considering only the nearest neighbor for classification. This pattern suggests a trade-off between the complexity of the model (as determined by the number of neighbors) and its predictive accuracy, with an optimal balance observed at K=1.

Tuning KNN

# Define a tuning grid for KNN
knn_grid <- expand.grid(k = seq(1, 20, by = 2))  # Define the range of K values

set.seed(1234)

knn_tune <- train(outcome ~ .,
                  data = df3_caret,
                  method = "knn",
                  trControl = my_ctrl,
                  tuneGrid = knn_grid)

knn_tune
## k-Nearest Neighbors 
## 
## 835 samples
##   6 predictor
##   2 classes: 'popular', 'unpopular' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 751, 751, 751, 752, 752, 752, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.7984282  0.4366710
##    3  0.8095540  0.4562571
##    5  0.8087932  0.4434316
##    7  0.8155969  0.4443954
##    9  0.8032565  0.3952688
##   11  0.8032472  0.3809159
##   13  0.7976579  0.3550789
##   15  0.7955975  0.3387744
##   17  0.7987487  0.3335358
##   19  0.7967883  0.3138451
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.

The KNN tuned accuracy is 0.8155969 with k = 7, which is slightly greater than default.

Predictions to understand the behavior of the tuned knn model

# Predictions to understand the behavior of the Gradient Boosted Trees model
pred_viz_knn_tune_probs <- predict(knn_tune, newdata = viz_grid, type = 'prob')

# Combine predictions with visualization grid
viz_knn_tune_df <- viz_grid %>% bind_cols(pred_viz_knn_tune_probs)

# Display a glimpse of the combined dataframe
viz_knn_tune_df %>% glimpse()
## Rows: 784
## Columns: 8
## $ R          <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G          <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B          <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness  <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue        <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular    <dbl> 0.1428571, 0.4285714, 0.7142857, 0.1428571, 0.1428571, 0.14…
## $ unpopular  <dbl> 0.8571429, 0.5714286, 0.2857143, 0.8571429, 0.8571429, 0.85…

Plotting Tuned model

# Plot the performance of Naive Bayes model

plot(knn_default, main="KNN Tuned Plot", xTrans=log)

Predictions

# Predictions for Naive Bayes default model
pred_viz_knn_probs_default <- predict(knn_default, newdata = viz_grid, type = 'prob')

# Combine predictions with the visualization grid
viz_knn_df_default <- bind_cols(viz_grid, as.data.frame(pred_viz_knn_probs_default))

# Glimpse the combined dataframe
glimpse(viz_knn_df_default)
## Rows: 784
## Columns: 8
## $ R          <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G          <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B          <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness  <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue        <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular    <dbl> 0.1428571, 0.4285714, 0.7142857, 0.1428571, 0.1428571, 0.14…
## $ unpopular  <dbl> 0.8571429, 0.5714286, 0.2857143, 0.8571429, 0.8571429, 0.85…
# Predictions for tuned Naive Bayes model
pred_viz_knn_probs_tune <- predict(knn_tune, newdata = viz_grid, type = 'prob')

# Combine predictions with the visualization grid
viz_knn_df_tune <- bind_cols(viz_grid, as.data.frame(pred_viz_knn_probs_tune))

# Glimpse the combined dataframe
glimpse(viz_knn_df_tune)
## Rows: 784
## Columns: 8
## $ R          <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G          <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B          <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness  <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue        <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular    <dbl> 0.1428571, 0.4285714, 0.7142857, 0.1428571, 0.1428571, 0.14…
## $ unpopular  <dbl> 0.8571429, 0.5714286, 0.2857143, 0.8571429, 0.8571429, 0.85…

Visulatization

viz_knn_df_default %>% 
  ggplot(mapping = aes(x = Hue, y = popular, color=B)) +
  geom_point(size = 3) +
  facet_wrap(~Lightness, labeller = 'label_both') +
  scale_color_gradient(low = "#00FFFF", high = "#104E8B") +
  theme_bw()

Tuned Predicted Probability of outcome for Naive Bayes

viz_grid %>% 
  bind_cols(predict(knn_tune, newdata = viz_grid, type = 'prob')) %>% 
  ggplot(mapping = aes(x = Hue, y = popular, color = B)) +
  geom_point(size = 3) +  
  facet_wrap(~Lightness, labeller = 'label_both') +
  scale_color_gradient(low = "#00FFFF", high = "#104E8B") +
  theme_bw()

For most of the models trained using neural network, random forests, GBM, GAM and KNN, I have used Accuracy as the performance metric. I was unaware of AOC ROC curve until I completed all the models and went to the next slide.

  1. Accuracy as the Evaluation Metric:
    • Accuracy is a straightforward metric that measures the proportion of correctly classified instances out of the total instances.
    • It is easy to interpret and understand, making it a popular choice for classification tasks.
    • Especially in balanced datasets (where the classes are roughly equal in size), accuracy provides a good overall measure of model performance.
  2. k-Fold Cross-Validation:
    • k-Fold Cross-Validation is a robust resampling technique for estimating the performance of a predictive model.
    • It divides the dataset into k equal-sized folds and iteratively trains the model on k-1 folds while using the remaining fold for validation.
    • This process is repeated k times, with each fold used exactly once as the validation set.
    • It helps in reducing bias and variance in model evaluation by using multiple train-test splits of the dataset.
    • k-Fold Cross-Validation provides a more accurate estimate of model performance compared to a single train-test split, especially when the dataset is limited.

By combining accuracy as the evaluation metric with k-fold cross-validation as the resampling scheme, you ensure that: - The model is evaluated based on its ability to correctly classify instances. - The model’s performance is assessed robustly across multiple train-test splits of the data, reducing the risk of overfitting or underfitting.

Accuracy Comparison:

KNN Model: - The tuned KNN model achieved an accuracy of 0.8155969 with k = 7, slightly higher than the default.

GAM Model: - The accuracy for the tuned GAM model is 0.8622984. There was no change in accuracy before and after tuning.

Neural Network Models: - Default Neural Network: - Size = 3, Decay = 1e-01 - Accuracy: 0.8331720, Kappa: 0.4839001

  • Tuned Neural Network:
    • Size = 20, Decay = 0.135335283
    • Accuracy: 0.8363562, Kappa: 0.5176417
    • The tuned neural network outperforms the default model.

Random Forest Models: - Default Random Forest: - Accuracy: 0.8578284, Kappa: 0.5708417 (mtry = 16)

  • Tuned Random Forest:
    • Accuracy: 0.8583507, Kappa: 0.5599688 (mtry = 6)
    • The tuned model shows a slight improvement in accuracy compared to the default.

GBM Model: - Default GBM: - Accuracy: 0.8554810 - Final values: n.trees = 100, interaction.depth = 3, shrinkage = 0.1, n.minobsinnode = 10

  • Tuned GBM:
    • Tuned Accuracy: 0.8599150, Kappa: 0.5796735
    • Final values: n.trees = 200, interaction.depth = 6, shrinkage = 0.1, n.minobsinnode = 5
    • The tuned model’s accuracy slightly exceeds that of the default.
library(ggplot2)

# Create a data frame with model names and accuracy values
data <- data.frame(Model = c("KNN", "GAM", "Neural Network", "Random Forest", "GBM"),
                   Accuracy = c(0.8155969, 0.8622984, 0.8363562, 0.8583507, 0.8599150))

# Create the ggplot
ggplot(data, aes(x = Model, y = Accuracy)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label = round(Accuracy, 4)), vjust = -0.3) +  # Add text labels for accuracy values
  labs(title = "Accuracy of Different Models", x = "Models", y = "Accuracy") +
  ylim(0.8, 0.9) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for better readability
## Warning: Removed 5 rows containing missing values (`geom_bar()`).

GAM is the best model according to the accuracy.

gam_tune %>% readr::write_rds("gam_tune.rds")